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Abstract 

We address the problem of estimating the difference between two probability den- 
sities. A naive approach is a two-step procedure of first estimating two densities 
separately and then computing their difference. However, such a two-step procedure 
does not necessarily work well because the first step is performed without regard to 
the second step and thus a small error incurred in the first stage can cause a big 
error in the second stage. In this paper, we propose a single-shot procedure for di- 
rectly estimating the density difference without separately estimating two densities. 
We derive a non-parametric finite-sample error bound for the proposed single-shot 
density-difference estimator and show that it achieves the optimal convergence rate. 
The usefulness of the proposed method is also demonstrated experimentally. 

Keywords 



density difference, L^-distance, robustness, Kullback-Leibler divergence, kernel den- 
sity estimation. 
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1 Introduction 

When estimating a quantity consisting of two elements, a two-stage approach of first 
estimating the two elements separately and then approximating the target quantity based 
on the estimates of the two elements often performs poorly, because the first stage is 
carried out without regard to the second stage and thus a small error incurred in the first 
stage can cause a big error in the second stage. To cope with this problem, it would be 
more appropriate to directly estimate the target quantity in a single-shot process without 
separately estimating the two elements. 

A seminal example that follows this general idea is pattern recognition by the support 
vector machine (Boser et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1998): Instead of 
separately estimating two probability distributions of patterns for positive and negative 
classes, the support vector machine directly learns the boundary between the two classes 
that is sufficient for pattern recognition. More recently, a problem of estimating the ratio 
of two probability densities was tackled in a similar fashion (Qin, 1998; Sugiyama et al., 
2008; Gretton et al., 2009; Kanamori et al., 2009; Nguyen et al., 2010; Kanamori et al., 
2012; Sugiyama et al., 2012b; Sugiyama et al., 2012a): The ratio of two probability densi- 
ties is directly estimated without going through separate estimation of the two probability 
densities. 

In this paper, we further explore this line of research, and propose a method for di- 
rectly estimating the difference between two probability densities in a single-shot process. 
Density differences are useful for various purposes such as class-balance estimation under 
class-prior change (Saerens et al., 2002; Du Plessis & Sugiyama, 2012), change-point de- 
tection in time series (Kawahara & Sugiyama, 2012; Liu et al., 2012), feature extraction 
(Torkkola, 2003), video-based event detection (Matsugu et al., 2011), flow cytometric 
data analysis (Duong et al., 2009), ultrasound image segmentation (Liu et al., 2010), 
non-rigid image registration (Atif et al., 2003), and image-based target recognition (Gray 
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& Principe, 2010). 

For tiiis density- differ cncc estimation problem, we propose a single-shot method, called 
the least-squares density-difference (LSDD) estimator, that directly estimates the density 
difference without separately estimating two densities. LSDD is derived within a frame- 
work of kernel least-squares estimation, and its solution can be computed analytically in a 
computationally efficient and stable manner. Furthermore, LSDD is equipped with cross- 
validation, and thus all tuning parameters such as the kernel width and the regularization 
parameter can be systematically and objectively optimized. We derive a finite-sample 
error bound for the LSDD estimator in a non-parametric setup and show that it achieves 
the optimal convergence rate. 

We also apply LSDD to L^-distance estimation and show that it is more accurate than 
the difi^erence of KDEs, which tends to severely under-estimate the L^-distance (Anderson 
et al., 1994). Compared with the Kullback-Leibler (KL) divergence (KuUback & Leibler, 
1951), the I/^-distance is more robust against outhers (Basu et al., 1998; Scott, 2001; 
Besbeas & Morgan, 2004). 

Finally, we experimentally demonstrate the usefulness of LSDD in semi-supervised 
class-prior estimation and unsupervised change detection. 

The rest of this paper is structured as follows. In Section 2, we derive the LSDD 
method and investigate its theoretical properties. In Section 3, we show how the L^- 
distance can be approximated by LSDD. In Section 4, we illustrate the numerical behavior 
of LSDD. Finally, we conclude in Section 5. 

2 Density-Difference Estimation 

In this section, we propose a single-shot method for estimating the difi^erence between two 
probability densities from samples, and analyze its theoretical properties. 
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2.1 Problem Formulation and Naive Approach 

First, we formulate the problem of density-difference estimation. 

Suppose that we are given two sets of independent and identically distributed sam- 
ples X := {xi}^^^ and X' := {x[,}'i,'^i drawn from probability distributions on MJ^ with 
densities p{x) and p'{x), respectively: 

X:^{x,}-^, '-'^'-pix), 

Our goal is to estimate the difference f{x) between p{x) and p'{x) from the samples X 
and X': 



f{x) :^p{x) -p'{x). 

A naive approach to density-difference estimation is to use kernel density estimators 
(KDEs) (Silverman, 1986). For Gaussian kernels, the KDE-based density-difference esti- 
mator is given by 

J{x) ■.= p{x) -p'{x), 



where 



p{x) 



1 " f \\x-x' IP 



The Gaussian widths a and a' may be determined based on cross-validation (Hardle et al.. 
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2004). 



However, we argue that the KDE-based density-difference estimator is not the best 
approach because of its two-step nature: Small estimation error in each density estimate 
can cause a big error in the final density-difference estimate. More intuitively, good 
density estimators tend to be smooth and thus a density-difference estimator obtained 
from such smooth density estimators tends to be over-smoothed (Hall & Wand, 1988; 
Anderson et al., 1994, see also numerical experiments in Section 4.1.1). 

To overcome this weakness, we give a single-shot procedure of directly estimating the 
density difference f{x) without separately estimating the densities p{x) and p'{x). 

2.2 Least-Squares Density-Difference Estimation 

In our proposed approach, we fit a density-difference model g{x) to the true density- 
difference function f{x) under the squared loss: 



9 J 

We use the following linear- in-parameter model as g{x): 

b 

g{x) = J29eM^)^d^i,{x), (2) 

e=i 

where b denotes the number of basis functions, ip{x) = {ipi{x) , . . . , tp{,{x))~^ is a 6- 
dimensional basis function vector, 6 = {9i, . . . ,9b)~^ is a 6-dimensional parameter vector, 
and ^ denotes the transpose. In practice, we use the following non-parametric Gaussian 
kernel model as g{x): 




(1) 




(3) 
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where (ci, . . . , c„, Cn+i, ■ ■ ■ , Cn+n') '■— (xi, ■ ■ ■ , Xn, x'l, . . . , x'^,) are Gaussian kernel centers. 
If n + n' is large, we may use only a subset of Xi, . . . ,Xn,x[, . . . , x'^, as Gaussian kernel 
centers. 

For the model (2), the optimal parameter Q* is given by 



6>* : 



argmin / (g((a;) — /(a?))^ dec 

9 J 

— argmin j g{xY(\.x ~ ^ J 

= argmin [O^HO - 2h^0] 

e 



where H is the 6x6 matrix and h is the 6-dimensional vector defined as 



h := tp{x)p{x)dx — I il){x')p{x')dx' 



Note that, for the Gaussian kernel model (3), the integral in H can be computed analyt- 
ically as 



\x — f \\x — C^'ll^ 
Hi^i> = I exp ( -- — ) exp ( — 1 da; 



2(72 



2(72 



4(72 



where d denotes the dimensionality of x. 

Replacing the expectations in h by empirical estimators and adding an ^2-regularizer 
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to the objective function, we arrive at the following optimization problem: 

d := argmin Id^Hd - 2h 6 + XO'^ o] , (4) 



9 



where A (> 0) is the regularization parameter and h is the 6-dimensional vector defined 
as 

1=1 i'=l 

Taking the derivative of the objective function in Eq.(4) and equating it to zero, we can 
obtain the solution 6 analytically as 



where lb denotes the 6-dimensional identity matrix. 

Finally, a density-difference estimator f[x) is given as 



f{x)^e''ii^{x). (5) 



We call this the least-squares density- difference (LSDD) estimator. 
2.3 Theoretical Analysis 

Here, we theoretically investigate the behavior of the LSDD estimator. 
2.3.1 Pcirametric Convergence 

First, we consider a linear parametric setup where basis functions in our density-difference 
model (2) are fixed. 
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Suppose that n/{n + n') converges to e [0, 1]. Then the central limit theorem (Rao, 



1965) asserts that J ^{9 - 6*) converges in law to the normal distribution with mean 



and covariance matrix 

H-\{l-rj)V, + rjV,,)H-\ 

where Vp denotes the covariance matrix of '0(a;) under the probability density p{x): 

Vp := j {^(x) - {^(x) - ^yp{x)dx, (6) 

and il)p denotes the expectation of i/^{x) under the probability density p{x): 

tl^p — J ^{x)p{x)dx. 

This result implies that the LSDD estimator has asymptotic normality with asymptotic 
order + 1/n', which is the optimal convergence rate in the parametric setup. 

2.3.2 Non-Parametric Error Bound 

Next, we consider a non-parametric setup where a density- difference function is learned 
in a Gaussian reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950). 
Let Tij be the Gaussian RKHS with width 7: 



kj{x, x') — exp ^- 



\x — x\Y 

—J2 



Let us consider a shghtly modified LSDD estimator that is more suitable for non- 
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parametric error analysis: For n' — n, 



/:=argmin 11511^2- 




where || • \\l2 denotes the L^-norm and || • ||^^ denotes the norm in RKHS T-L^. 

Then we can prove that, for all p, p' > 0, there exists a constant K such that, for all 
T > 1 and n > 1, the non-parametric LSDD estimator with appropriate choice of A and 
7 satisfies-^ 



with probability not less than 1 — 46""^. Here, d denotes the dimensionality of input vector 
cc, and a > denotes the regularity of Besov space to which the true density-difference 
function / belongs (smaller/larger a means / is "less/more complex"; see Appendix A for 



& Steinwart, 2011), the above result shows that the non-parametric LSDD estimator 
achieves the optimal convergence rate. 

It is known that, if the naive KDE with a Gaussian kernel is used for estimating a 
probability density with regularity a > 2, the optimal learning rate cannot be achieved 
(Farrell, 1972; Silverman, 1986). To achieve the optimal rate by KDE, we should choose 
a kernel specifically tailored to each regularity a (Parzen, 1962). But such a kernel is not 
non-negative and it is difficult to implement in practice. On the other hand, our LSDD 
estimator can always achieve the optimal learning rate with a Gaussian kernel without 
regard to regularity n . 

■"^ Because our theoretical result is highly technical, we only describe a rough idea here. More pre- 
cise statement of the result and its complete proof are provided in Appendix A, where we utilize the 
mathematical technique developed in Eberts and Steinwart (2011) for a regression problem. 




(7) 



2oi 

its precise definition). Because n~'^^ is the optimal learning rate in this setup (Eberts 
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2.4 Model Selection by Cross- Validation 

The above theoretical analyses showed the superiority of LSDD. However, the practical 
performance of LSDD depends on the choice of models (i.e., the kernel width a and the 

regularization parameter A). Here, we show that the model can be optimized by cross- 
validation (CV). 

More specifically, we first divide the samples X — {xi}f^^ and X' — {cc^,}p'^^ into T 
disjoint subsets {Xt}f^i and {X^}J^^, respectively. Then we obtain a density-difference 
estimate ft{x) from X\Xt and X'\XI (i.e., all samples without Xt and A"/), and compute 
its hold-out error for Xt and A"/ as 



where \X\ denotes the number of elements in the set X. We repeat this hold-out vahdation 
procedure for t = 1, . . . , T, and compute the average hold-out error as 



CVW := / U^fdx - r^E^(^) + T^i E^(^'), 






t=l 



Finally, we choose the model that minimizes CV. 



A MATLAB® implementation of LSDD is available from 



http : // sugiyama-www . cs . t itech . ac . jp/~ sugi/sof tware/LSDD/'. 



(to be made public after acceptance) 
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3 Z/^-Distance Estimation by LSDD 

In this section, we consider the problem of approximating the L^-distance between p{x) 
and p'{x), 

L\p,p'):^ I {p(x)-p'(x)fdx, (8) 
from samples X :— {xiYl^^ and X' :— {a;-,}^li (see Section 2.1). 



3.1 Basic Form 

For an equivalent expression 

L'(p,p')^ j f(x)p(x)dx- j f(x')p\x')dx', 

if we replace f{x) with an LSDD estimator f(x) and approximate the expectations by 
empirical averages, the following L^-distance estimator can be obtained: 

L\p,p')^h^d. (9) 



Similarly, for another expression 

L'{p,p') = I fixfdx, 
replacing f{x) with an LSDD estimator f{x) gives another L^-distance estimator: 



L'^(p,p') HO. (10) 
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3.2 Reduction of Bias Caused by Regularization 

Eq.(9) and Eq.(lO) themselves give approximations to L'^{p,p'). Nevertheless, we argue 
that the use of their combination, defined by 

L'^ix, X') := 2he - e nd, (ii) 

is more sensible. To explain the reason, let us consider a generalized L^-distance estimator 
of the following form: 

PhJd+{l- /3)d^Hd, (12) 

where [3 is a real scalar. If the regularization parameter A (> 0) is small, then Eq.(12) 

can be expressed as 

Ph^e + (1 - 0^He = h H-^h - A(2 - p)h H-'^h + Op(A), (13) 

where Op denotes the probabilistic order (its derivation is given in Appendix B). 

Thus, the bias introduced by regularization (i.e., the second term in the right-hand 
side of Eq.(13) that depends on A) can be eliminated if ^ = 2, which yields Eq.(ll). Note 
that, if no regularization is imposed (i.e., A = 0), both Eq.(9) and Eq.(lO) yield h H~^h, 
the first term in the right-hand side of Eq.(13). 

Eq.(ll) is actually equivalent to the negative of the optimal objective value of the 
LSDD optimization problem without regularization (i.e., Eq.(4) with A = 0). This can 
be naturally interpreted through a lower bound of L^(p,p') obtained by Legendre-Fenchel 
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convex duality (Rockafellar, 1970): 



L^{p,p') = sup 

9 



g{x)p{x)dx — / g{x)p'{x)dx 



g{xfdx 



where the supremum is attained at g — f. If the expectations are replaced by empirical es- 
timators and the linear-in-parameter model (2) is used as g, the above optimization prob- 
lem is reduced to the LSDD objective function without regularization (see Eq.(4)). Thus, 
LSDD corresponds to approximately maximizing the above lower bound and Eq.(ll) is 
its maximum value. 

Through eigenvalue decomposition of H, we can show that 



2h^d - e nd > hJe > e ne. 



Thus, our approximator (11) is not less than the plain approximators (9) and (10). 
3.3 Further Bias Correction 

h H-^h, the first term in Eq.(13), is an essential part of the L^-distance estimator 
(11). However, it is actually a shghtly biased estimator of the target quantity H~^h 
(= 0*^He* = hJO*): 

= hJH-^h + tr {^-^ {^Vp + ^,vj^ ^ , (14) 

where E denotes the expectation over all samples X = {xi}^^^ and X' = {a;-,}^'^^, and 
Vp and Vp' are defined by Eq.(6) (its derivation is given in Appendix C). 

The second term in the right-hand side of Eq.(14) is an estimation bias that is generally 
non-zero. Thus, based on Eq.(14), we can construct a bias-corrected L^-distance estimator 
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as 



L'^{X,X') :=2^ e-d Hd-tT[H 




) 



) 



(15) 



where Vp is an empirical estimator of covariance matrix Vp-. 




n 



i=l 



and ijjp is an empirical estimator of the expectation ij^p-. 



The true L^-distance is non-negative by definition (see Eq.(8)), but the above bias- 
corrected estimate can take a negative value. Following the same hne as Baranchik (1964), 
the positive-part estimator may be more accurate: 



However, in our preliminary experiments, L (X, X') does not always perform well partic- 
ularly when H is ill-conditioned. For this reason, we practically propose to use L'^{X, X') 
defined by Eq.(ll). 

4 Experiments 

In this section, we experimentally evaluate the performance of LSDD. 
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4.1 Numerical Examples 

First, we show numerical examples using artificial datasets. 
4.1.1 LSDD vs. KDE 

We experimentally compare the behavior of LSDD and the KDE-based method. Let 

p{x) = N{x; (/i,0,...,0)T,(47r)-i/rf), 
p'{x)^N{x;{0,0,...,0y,{4n)-'la), 

where N{x; n, S) denotes the multi-dimensional normal density with mean vector /u, and 
variance-covariance matrix E with respect to x, and 1^ denotes the d- dimensional identity 
matrix. 

We first illustrate how LSDD and KDE behave under d — 1 and n — n' — 200. Figure 1 
depicts the data samples, densities and density difference estimated by KDE, and density 
difference estimated by LSDD for = (i.e., f{x) — p{x) — p'{x) — 0). This shows that 
LSDD gives a more accurate estimate of the density difference f{x) than KDE. Figure 2 
depicts the results for n = 0.5 (i.e., f{x) ^ 0), showing again that LSDD performs well. 

Next, we compare the L^-distance approximator based on LSDD and that based on 
KDE. For = 0,0.2,0.4,0.6,0.8 and d = 1,5, we draw n = n' = 200 samples from 
the above p{x) and p'[x). Figure 3 depicts the mean and standard error of estimated 
L^-distances over 100 runs as functions of mean When d — 1, the LSDD-based L^- 
distance estimator gives accurate estimates of the true L^-distance, whereas the KDE- 
based L^-distance estimator slightly underestimates the true L^-distance. This is caused 
by the fact that KDE tends to provide smoother density estimates (see Figure 2(c) again). 
Such smoother density estimates are accurate as density estimates, but the difference of 
smoother density estimates yields a smaller L^-distance estimate (Anderson et al., 1994). 
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Figure 1: Estimation of density dif- 
ference when /i = (i.e., /(x) = 
p{x) — p'{x) = 0). 



Figure 2: Estimation of density dif- 
ference when /i = 0.5 (i.e., /(x) = 
p{x) — p'{x) 7^ 0). 
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(a) d = 1 



(h) d = 5 



Figure 3: L^-distance estimation by LSDD and KDE. Means and standard errors over 
100 runs are plotted. 

This tendency is more significant when d = 5; the KDE-based L^-distance estimator 
severely underestimates the true L^-distance, which is a typical drawback of the two- 
step procedure. On the other hand, the LSDD-based L^-distance estimator still gives 
reasonably accurate estimates of the true L^-distance even when d = 5. 

4.1.2 L^-Distance vs. KL-Divergence 

The Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) is a popular divergence 
measure for comparing probability distributions. The KL-divergence from p{x) to p'{x) 
is defined as 



J P{x) 



First, we illustrate the difference between the L^-distance and the KL-divergence. For 
d=l, let 



pix) = (1 - ri)N{x; 0, 1^) + r]N{x; /i, 1/4^), 
p'(x) = N{x;0,l^). 
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Implications of the above densities are that samples drawn from N{x; 0, 1^) are inliers, 
whereas samples drawn from N{x; /i, 1/4^) are outliers. We set the outlier rate at r/ = 0.1 
and the outlier mean at /i — 0,2,4, ... ,10 (see Figure 4). 

Figure 5 depicts the L^-distance and the KL-divergence for outlier mean // = 
0, 2, 4, . . . , 10. This shows that both the L^-distance and the KL-divergence increase as /i 
increases. However, the L^-distance is bounded from above, whereas the KL-divergence 
diverges to infinity as /i tends to infinity. This result imphes that the L^-distance is less 
sensitive to outliers than the KL-divergence, which well agrees with the observation given 
in Basu et al. (1998). 

Next, we draw n — n' — 100 samples from p{x) and p'{x), and estimate the L^-distance 
by LSDD and the KL-divergence by the Kullback-Leibler importance estimation procedure^ 
(KLIEP) (Sugiyama et al., 2008; Nguyen et al., 2010). Figure 6 depicts the estimated 
L^-distance and KL-divergence for outlier mean = 0, 2, 4, . . . , 10 over 100 runs. This 
shows that both LSDD and KLIEP reasonably capture the profiles of the true L^-distance 
and the KL-divergence, although the scale of KLIEP values is much different from the 
true values (see Figure 5) because the estimated normalization factor was unreliable. 

Finally, based on the permutation test procedure (Efron & Tibshirani, 1993), we con- 
duct hypothesis testing of the null hypothesis that densities p and p' are the same. More 
specifically, we first compute a distance estimator for the original datasets X and X' and 
obtain D{X, X'). Next, we randomly permute the lA" U A''! samples, and assign the first 
\X\ samples to a set X and the remaining \X'\ samples to another set X' . Then we com- 
pute the distance estimator again using the randomly permuted datasets X and X' and 
obtain D{X, X'). Since X and X' can be regarded as being drawn from the same distri- 

^Estimation of the KL-divergence from data has been extensively studied recently (Wang et al., 2005; 
Sugiyama et al., 2008; Perez-Cruz, 2008; Silva & Narayanan, 2010; Nguyen et al., 2010). Among them, 
KLIEP was shown to possess a superior convergence property and demonstrated to work well in practice. 
KLIEP is based on direct estimation of density ratio p{x)/p'{x) without density estimation oi p{x) and 
p'{x). 
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Figure 4: Comparing two densities in the presence of outliers. p{x) includes outliers at 
/i = 0,2, 4, ...,10. 
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Figure 5: The true L^-distance and true 
KL-divergence as functions of outlier mean 
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Figure 6: Means and standard errors of 
L^-distance estimation by LSDD and KL- 
divergence estimation by KLIEP over 100 
runs. 




Figure 7: Two-sample test for outlier rate 
= 0.1 as functions of outlier mean /i. 



Figure 8: Two-sample test for outlier mean 
/X = 10 as functions of outlier rate rj. 
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/*. Ptrmn{x\y = +1) 




Figure 9: Schematic illustration of semi-supervised class-balance estimation. 

bution, D{X, X') would take a value close to zero. This random permutation procedure 
is repeated many times, and the distribution of D{X, X') under the null hypothesis (i.e., 
the two distributions are the same) is constructed. Finally, the p-value is approximated 
by evaluating the relative ranking of D(X, X') in the histogram of D(X, X'). We set the 
significance level at 5%. 

Figure 7 depicts the rejection rate of the null hypothesis for outlier rate = 0.1 and 
outlier mean /i = 0, 2, 4, . . . , 10, based on the L^-distance estimated by LSDD and the 
KL-divergence estimated by KLIEP. This shows that the KLIEP-based test rejects the 
null hypothesis more frequently for large /i, whereas the rejection rate of the LSDD-based 
test is kept almost constant even when /i is changed. This result implies that the two- 
sample test by LSDD is more robust against outliers (i.e., two distributions tend to be 
regarded as the same even in the presence of outliers) than the KLIEP-based test. 

Figure 8 depicts the rejection rate of the null hypothesis for outlier mean = 10 for 
outlier rate = 0, 0.05, 0.1, . . . , 0.35. When rj = Q (i.e., no outliers), both the LSDD-based 
test and the KLIEP-based test accept the null hypothesis with the designated significance 
level approximately. When = 0.1, the LSDD-based test still keeps a low rejection rate, 
whereas the KLIEP-based test tends to reject the null hypothesis. When rj > 0.3, the 
LSDD-based test and the KLIEP-based test tend to reject the null hypothesis in a similar 
way. 
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Figure 10: Results of semi-supervised class-balance estimation. Left: Squared error of 
class balance estimation. Right: Misclassification error by a weighted regularized least- 
squares classifier. 
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4.2 Applications 

Next, we apply LSDD to semi-supervised class-balance estimation under class prior change 
and change-point detection in time series. 

4.2.1 Semi-Supervised Class-Balance Estimation 

In real-world pattern recognition tasks, changes in class balance are often observed. Then 
significant estimation bias can be caused since the class balance in the training dataset 
does not reflect that of the test dataset. 

Here, we consider a pattern recognition task of classifying pattern cc e R*^ to class 
y G {+1, — !}• Our goal is to learn the class balance of a test dataset in a semi-supervised 
learning setup where unlabeled test samples are provided in addition to labeled training 
samples (Chapelle et al., 2006). The class balance in the test set can be estimated by 
matching a mixture of class-wise training input densities, 

7rptrain(a;|y = +1) + (l " ■^)Ptre,m{x\y = -l), 

with the test input density ptest(a^) (Saerens et al., 2002), where tt G [0,1] is a mixing 
coefficient to learn. See Figure 9 for schematic illustration. Here, we use the L^-distance 
estimated by LSDD and the difference of KDEs for this distribution matching. 

We use four UCI benchmark datasets^, where we randomly choose 20 labeled training 
samples from each class and 50 unlabeled test samples following true class-prior tt* — 
0.1, 0.2, . . . , 0.9. Figure 10 plots the mean and standard error of the squared difference 
between true and estimated class balances tt and the misclassification error by a weighted 
regularized least-squares classifier (Rifkin et al., 2003) over 1000 runs. The results show 
that LSDD tends to provide better class-balance estimates, which are translated into 
^http : //archive . ics . uci . edu/ml/ 
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Figure 11: Schematic illustration of unsupervised change detection. 



lower classification errors. 



4.2.2 Unsupervised Change Detection 



The objective of change detection is to discover abrupt property changes behind time- 
series data. 

Let y{t) G M*" be an m-dimensional time-series sample at time t, and let 



Yit) := [yity, y{t + 1)^, ...^y{t + k- lYY e 



be a subsequence of time series at time t with length k. We treat the subsequence Y(t) 
as a sample, instead of a single point y{t), by which time-dependent information can be 
incorporated naturally (Kawahara & Sugiyama, 2012). Let y{t) be a set of r retrospective 
subsequence samples starting at time t: 



y{t) := {Y{t), Y{t + l),...,Y{t + r-l)}. 



Our strategy is to compute a certain dissimilarity measure between two consecutive seg- 
ments y{t) and y(t + r), and use it as the plausibility of change points (see Figure 11). As 
a dissimilarity measure, we use the L^-distance estimated by LSDD and the KL-divergence 



Density-Difference Estimation 



24 



0.1 



f«V'*V*Mhf*« 



-0.1 
-0.2 



40 



20 



Original data 



500 1000 
KLIEP score 



1500 





500 1000 1500 

LSDD score 




500 



1000 



1500 



Time 

(a) Speech data 



Original data 




500 1000 1500 2000 2500 

LSDD score 




500 1000 1500 2000 2500 

Time 

(b) Accelerometer data 

Figure 12: Results of unsupervised change detection. Top: Original time-series. Middle: 
Change scores obtained by KLIEP. Bottom: Change scores obtained by LSDD. 
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estimated by the KL importance estimation procedure (KLIEP) (Sugiyama et al., 2008; 
Nguyen et al., 2010). We set A; = 5 and r = 50. 

First, we use the IPS J SIG-SLP Corpora and Environments for Noisy Speech Recog- 
nition (CENSREC) dataset^ provided by the National Institute of Informatics, Japan, 
which records human voice in a noisy environment such as a restaurant. The top graph 
in Figure 12(a) displays the original time-series, where true change points were manually 
annotated. The bottom two graphs in Figure 12(a) plot change scores obtained by KLIEP 
and LSDD, showing that the LSDD-based change score indicates the existence of change 
points more clearly than the KLIEP-based change score. 

Next, we use a dataset taken from the Human Activity Sensing Consortium (HASC) 
challenge 2011^, which provides human activity information collected by portable three- 
axis accelerometers. Because the orientation of the accelerometers is not necessarily fixed, 
we take the ^2-norm of the 3-dimensional data. The top graph in Figure 12(b) displays 
the original time-series for a sequence of actions "jog", "stay", "stair down", "stay", and 
"stair up" (there exists 4 change points at time 540, 1110, 1728, and 2286). The bottom 
two graphs in Figure 12(b) depict the change scores obtained by KLIEP and LSDD, 
showing that the LSDD score is much more stable and interpretable than the KLIEP 
score. 

5 Conclusions 

In this paper, we proposed a method for directly estimating the difference between two 
probability density functions without density estimation. The proposed method, called 
the least-squares density- difference (LSDD), was derived within a framework of kernel 
least-squares estimation, and its solution can be computed analytically in a computation- 

**http : //research. nil . ac . jp/src/en/ CENSREC- 1-C .html 
^http : //hasc . jp/hc2011/ 
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ally efficient and stable manner. Furthermore, LSDD is equipped with cross-validation, 
and thus all tuning parameters such as the kernel width and the regularization parameter 
can be systematically and objectively optimized. We showed the asymptotic normality 
of LSDD in a parametric setup and derived a finite-sample error bound for LSDD in a 
non-parametric setup. In both cases, LSDD achieves the optimal convergence rate. 

We also proposed an L^-distance estimator based on LSDD, which nicely cancels a 
bias caused by regularization. The LSDD-based L^-distance estimator was experimentally 
shown to be more accurate than the difference of kernel density estimators and more robust 
against outliers than KuUback-Leibler divergence estimation. 

Density-difference estimation is a novel research paradigm in machine learning, and 
we have given a simple but useful method for this emerging topic. Our future work will 
develop more powerful algorithms for density-difference estimation and explores a variety 
of applications. 
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A Technical Details of Non-Parametric Convergence 
Analysis in Section 2.3.2 

First, we define linear operators Pn,P,Pn,P' ,Qn,Q as 

n „ 

^n/:=-$]/(a^.), Pf:= / f{x)p{x)dx, 

1 r 

Qnf := Pnf - PJ, Qf:=Pf-P'f. 
Let Hj be an RKHS endowed with the Gaussian kernel with width 7: 



, , / \\x-xT \ 
k^{x, X ) = exp I — 1 



A density-difference estimator / is obtained as 



/:=argmin ||/||i.(M.) - 2g„/ + A||/||^^ 



We assume the following conditions: 
Assumption 1. The densities are bounded: There exists M such that 



< M and \\p'\\^ < M. 



The density difference f = p — p' is a member of Besov space with regularity a: f E B: 
and, for r — [aj + 1 where \_a\ denotes the largest integer less than or equal to a, 



a 

2,00 



t>0 
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where is the Besov space with regularity a and Ur^L^(Rd) is the r-th modulus of smooth- 
ness (see Eherts and Steinwart (2011) for the definitions) . 

Then we have the following theorem. 

Theorem 2. Suppose Assumption 1 is satisfied. Then, for all e > and p e (0, 1), there 

exists a constant K > depending on M, c, e,p such that for all n > 1, t > 1, and A > 0, 
the LSDD estimator f in satisfies 

\ A i+p n^+p / 

with probability not less than 1 — 4e~'^. 

To prove this, we utilize the technique developed in Eberts and Steinwart (2011) for 
a regression problem. 

Proof. First, note that 

ll/llL2(Kd) - 2Qnf + ||/||i2(]Rd) + A||/||^^ < ||/o||i2(Rd) - 2QnfQ + ||/||i2(]Rd) + A||/o||^^. 

Therefore, we have 

11/ + ^11/11^. 

= - 2QJ+ + 2(g„ - Q)f + A||/||^^ 

< ll/o|li2(Md) - 2Q„/o + ||/||i2(i,d) + 2(Q„ - Q)f + XWfWl^ 

= ll/o|li2(Md) - 2g/o + ||/||i2(M.) + 2(g„ - Q){f- /o) + A||/||^^ 

= ll/o - /lli.(Md) + 2(g„ - Q)if- f) + 2(g„ -Q){f- fo) + XWfWl,^. (16) 
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Let 

d 

2\\x\ 



d 



and f{x) := (7-\/7r) ^f- Using K and /, we define 

U-.^K^J:^ f f{y)K{x-y)dy, 

i.e., /o is the convolution of K and /. Because of Lemma 2 in Eberts and Steinwart 
(2011), we have /o G and 



ll/oll^^ < (2^ - l)||/||i2(Kd) (•.• Lemma 2 of Eberts and Steinwart (2011)) 
<(2'--l)(7V^)-i||/|U.(M.) 

< (2'' - i)(7V7r)"^(lbl|L2(Rd) + Wp'Wvhr")) 

<{2' -1){jV^)-I2Vm. (17) 



Moreover, Lemma 3 in Eberts and Steinwart (2011) gives 



||/o|U<(2^-l)||/||oo<(2'--l)M, (18) 

and Lemma 1 in Eberts and Steinwart (2011) yields that there exists a constant Cr,2 such 
that 

Il/O - /lli.(M<^) < a,2<^.(M.)(/, |) < a,2cV"- (19) 

Now, following a similar line to Theorem 3 in Eberts and Steinwart (2011), we can 
show that, for all e > and p e (0, 1), there exists a constant Ce,p such that 



|(p„-p)(/- /)!</-/. 
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To bound this, we derive the tail probabihty of 



f-f 



where r > is a positive real such that r > r* for 



r* = ^^J\f-f\\hm + M\f\\n,- 



Let 



f-f 



11/ + ^11/11^ +^ 



for f E H-y and r > r*. Then we have 



\\9f,r 



< 



< 



+ 



oo I WJ oo 



< 



ll/-/IIV) + A||/||?,, + r 

||/-/|lC) + A||/||?,^+r 

I M 

+ —< 



1 M 
+ — , 



A||/lk + r/||/|k r - 2v^ r 



and 



Pg 



P{f-ff 



< 



^'^ (11/ - /lli.(M.) + Mlffn, + rY ^ (11/ - /||i.(M.) + A||/||2,^ + r)2 



Here, let 



Tr := {/ e I 11/ - fWl,,^,, + A||/||^^ < r}. 



and we assume that there exists a function such that 



E 



sup|(P„-P)(/-/)| 
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where E denotes the expectation over all samples. Then, by the peeling device (see 
Theorem 7.7 in Steinwart & Christmann, 2008), we have 

feu-, r 
Therefore, by Talagrand's concentration inequality, we have 



Pr 



, lOifJr) 2Mt 14r / 1 M 

sup P„ - P)gf,r < + \ + — + — 

feH-y r \ nr 3n \2VrX r 



> 1 - e"^, (20) 



where Pr[-] denotes the probability of an event. 

From now on, we give an upper bound of The RKHS can embedded in 

arbitrary Sobolev space H^'"(M'^). Indeed, by the proof of Theorem 3.1 in Steinwart and 
Scovel (2004), we have 



for all / G "H^. Moreover, the theories of interpolation spaces give that, for all / G 
W"^{R'^), the supremum norm of / can be bounded as 



oo — ^mll/llL2m£^^||/||^ymmd^) 



if (i < 2m. Here we set m = tt-. Then we have 

2p 



<^;ii/iii;fM.)ii/ir«,7-'^. 



Now, since J> C {r / XY^'^B^^ and 



P(/-/)'<M||/-/||i.(^.)<Mr for/Gj; 
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hold from Theorem 7.16 and Theorem 7.34 in Steinwart and Christmann (2008), we can 

take 



/ \ I ^ (i-p)(i+,)d /r\ 2 , _-,/2 

(^„(r) =max<( Ci,p,e7 ^ [j) (Mr) ^ n 



(l-p)(l+e)d /r\T+i 



r\2 d(l-p) 1-p 

— 17 * 7" 2 

A/ 



i+p 



n 



-V(i+p) 



where e > and p e (0, 1) are arbitrary and Ci^p^g, C2,p,e are constants depending on p, e. 
In the same way, we can also obtain a bound of sup^^^^ |(P^ — P'^gj^r\- 
If we set r to satisfy 



1^10^^^ /2Mt I 14t / 1 ^ M\ 
8 ~ r V nr 3n V2VrA r / ' 



(21) 



then we have 



\{Qn - Q){f- f)\ <l{r+ 11/ - /||i2(^.) + XWfWn,) , 



(22) 



with probability 1 — 2e To satisfy Eq.(21), it suffices to set 



r = C 



-(l-p)(l+e)d 



XPn 



+ 



_ 2(l-p)d ^ j-p. 
3p-p^ 2 

A 1+P ni+p 



T ^ r 
n^A n / ' 



(23) 



where C is a sufficiently large constant depending on M, e, p. 

Finally, we bound the term {Qn — Q){fo — f)- By Bernstein's inequality, we have 

\iPn - P){f0 -f)\<C (11/ - /o|U,(p)yi + ^) 

/ 2^' Mr 

< C i V2M\\f - JoWlHr^J- + 



^ C {\\f ~ /o|lL2(Kd) 



2Mr 2^Mr 
+ + 



n 



n 



(24) 
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with probability 1 — e ^, where C is a universal constant. In a similar way, we can also 
obtain 

\{PL - nifo -f)\<c (11/ - /oiii.(,.) + ^ + ^) . 

Combining these inequalities, we have 

\{Qn - Q){f0 -f)\<C l^\f - /o||i.(K.) + , (25) 

with probability 1 — 2e^^, where C is a universal constant. 
Substituting Eqs.(22) and (25) into Eq.(16), we have 

\\f-f\\U^^) + M\f\\n^ 

< 2 { ll/o - + C (^11/ - /o||i.(i,.) + ^) + r + All/olk,} , 

with probability 1 — 4e~'^. Moreover, by Eqs.(19) and (17), the right-hand side is further 
bounded by 

11/ - /lli^(M<^) + mfn, < C {f " + r + Xr' + ^} , 
Finally, substituting (23) into the right-hand side, we have 

II/-/IIV) + A||/||^, 

with probabihty 1 — Ae''^ for r > 1. This gives the assertion. □ 
If we set 

2a+d 1 

X = n (2«+d)(l+p)+(e-p+ep) ^ '-f = fl (2a+d)(l+p)+(e-p+ep) ^ 
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and take e, p sufficiently small, then we immediately have the following corollary. 

Corollary 1. Suppose Assumption 1 is satisfied. Then, for all p, p' > 0, there exists 
a constant K > depending on M,c,p,p' such that for all n > 1, t > 1, the density- 
difference estimator f with appropriate choice of 7 and A satisfies 

11/ - /lli.(M<^) + A||/||^, < K {n-^^^^ + , (26) 

with probability not less than 1 — 4e~'^. 

2a 

Note that n 2t»+d is the optimal learning rate to estimate a function in -Bf (Eberts 
& Steinwart, 2011). Therefore, the density-difference estimator with a Gaussian kernel 
achieves the optimal learning rate by appropriately choosing the regularization parameter 
and the Gaussian width. Because the learning rate depends on a, the LSDD estimator 
has an adaptivity to the smoothness of the true function. 

Our analysis heavily rehes on the techniques developed in Eberts and Steinwart (2011) 
for a regression problem. The main difference is that the analysis in their paper involves 
a clipping procedure, which stems from the fact that the analyzed estimator requires an 
empirical approximation of the expectation of the square term. The Lipschitz continuity of 
the square function / h-)- is utilized to investigate this term, and the clipping procedure 
is used to ensure the Lipschitz continuity. On the other hand, in the current paper, we 
can exactly compute ||/|||2(]Kd) so that we do not need the Lipschitz continuity. 

B Derivation of Eq.(13) 

When A (> 0) is small, {H + XIb)~^ can be expanded as 
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where Op denotes the probabihstic order. Then Eq.(12) can be expressed as 



(3h e + {i-(5)e He 

= PhJ {H + Xh)-^ h+{l- {H + Xlb)-^ H{H + A/ft)-' h 
= (5h H^h - \(5hJ H-^h 

+ (1 - H-^h - 2A(1 - H-'^h + Op{\) 
= h H^h - A(2 - ^)h H'^h + Op(A), 



which concludes the proof. 



C Derivation of Eq.(14) 



Because E[/i] = we have 



E[h H-^h - hJH-^h] = E[(h - h)'^ H-\h - h)] 

= tr [H-^¥.[{h - h)(h - hy]^ 




which concludes the proof. 



